In this project, we use R and apply exploratory data analysis techniques to explore relationships in one variable to multiple variables and to explore a selected data set for distributions, outliers, and anomalies. This data set is about red wine quality. It contains some chemical properties for each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). We want to determine which chemical properties influence the quality of red wines.
After loading the data, lets take a global view. The types of variables and some examples of values:
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The first 3 observations:
## id fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## quality
## 1 5
## 2 5
## 3 5
A global summary of the statistics:
## id fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
There are 1599 observations with 13 variables. The first one is the id of the observation. All variables are numerical. Some of them seem to have outliers.
First, lets explore the values of “quality”, our outcome variable:
The variable “quality” has only 6 different discrete values (3, 4, 5, 6, 7, 8), so it is converted to factor type.
Now lets explore the distribution for each of the other variables:
Most of the histograms are skewed right.
Volatile acidity seems to have a core of common values between 0.3 and 0.7. Lets see them as buckets:
Around 25% of observations have a value for citric acid that is lower than 0.1. 50% of observations have a value for citric acid that is between 0.1 and 0.4:
There is a value of 1, probably a measurement error: excluding it, the biggest value around 0.8.
Free sulfur dioxide seems to have a group of very common values. Lets investigate further with some transformations (log10 and sqrt):
A log transformation of free.sulfur.dioxide reveals a more or less normal distribution. On the other hand, a sqrt transformation reveals that there are three common values around 6, 11 and 16.
Also it seems that regular wines (quality 5 or 6) tend to have higher values (14, 15) of free sulfur dioxide:
Now lets create a new variables, bound sulfur dioxide (nonfree.sulfur.dioxide = total.sulfur.dioxide - free.sulfur.dioxide), and compare it with free and total sulfur dioxide:
Bound sulfur dioxide (nonfree.sulfur.dioxide) tends to have slightly higher values than free sulfur dioxide. Total sulfur dioxide seems to be more smoothed along the values.
Now we are going to calculate the percentage of free sulfur dioxide. We call this new variable pfree.sulfur.dioxide:
The percentage of free sulfur dioxide has a distribution almost normal, with mean around 0.4.
Regarding alcohol variable, most of the observations have an alcohol value between 9 and 12, with a median of 10:
Now lets compare the different distributions for each level of quality. For this, we considere 3 groups of qualities: bad (4 or lower), regular (5 or 6) and good (7 or higher). We create a new variable (class) indicating the group.
It seems that bad wines have a bigger volatile acidity, and they don’t have high citric acid values. Also they tend to have lower sulphate values. Good wines tend to have more alcohol.
The dataset is a tidy one and it has 1599 observations with 13 variables for each one. All of the observations are numerical. The first one is an index. The “quality” variable has only 6 discrete values: 4, 5, 6, 7, 8.
Since “quality” is the outcome, the variables “volatile.acidity”, “citric.acid”, “sulphates” and “alcohol” seem to be interesting. The distributions for these variables tend to be different across levels of “quality”.
Maybe the free sulfur dioxide variable could contribute to predict the outcome. We will further investigate this variable or others derived from it.
When using the values of the variable “alcohol”, they are rounded to integer values. The outcome “quality” was converted to factor type with 6 levels.
We also created three new variables:
Most of the features have outliers that are far beyond the 3rd quartile in their distributions. Maybe this is also one of the reasons why most of them are righ skewed.
Volatile acidity seems to have a core of common values between 0.3 and 0.7. Half of the citric acid values are between 0.09 and 0.42. It has an outlier value of 1, probably a measurement error: excluding it, the biggest value is 0.79. There are very few observations with a “quality” value different of 5 or 6.
For free sulfur dioxide we detected a very common value between 5 and 6. A logarithmic transformation gave us a distribution more similar to a normal one. Also we tried to perform a sqrt transformation and we detected common values around 6, 11 and 16. The median values for regular wines (quality 5 or 6) are higher than the median values for other qualities (bad and good, which have similar free sulfur dioxide median values).
Regarding the new variables, bound sulfur dioxide (“nonfree.sulfur.dioxide”) tends to have bigger values than free sulfur dioxide. The percentage of free sulfur dioxide (“pfree.sulfur.dioxide”) has a distribution almost normal, with mean around 0.4.
Most of the observations have an alcohol value between 9 and 12, with a median of 10. It is strange that wines with a quality of 5 tend to have less alcohol.
As explained above, values of “alcohol” variable are rounded to integer and the outcome (“quality”) was converted to factor type. There are not other remarkable changes in the data.
We check the Pearson’s correlation between all pairs of variables. We can see it in a graphical way using the psych package:
As suspected, our initial guess about the main features is consistent with the correlation values we obtained before. The features “volatile.acidity”, “citric.acid”, “sulphates” and “alcohol” shows the bigger correlation values, with an absolute value ranging from 0.23 to 0.48. In the case of “volatile.acidity”, it is a negative correlation.
Lets examine these variables with some boxplots by quality classes:
Box plots corroborate our previous findings. We see a clear positive tendency in all of them (except in volatile.acidity, where it is negative). In the case of alcohol we can not see a difference between bad and regular wines. The values of alcohol for regular wines seems to be very spread. Lets examine it using directly the quality values (5 and 6):
There is a jump for alcohol variable between qualities 5 and 6. Maybe this is a separation between potentially bad wines and potentially good wines.
Based on our previous analysis, we have been checking some correlations. Box plots diagrams for each quality level have shown a tendency for these variables: “volatile.acidity”, “citric.acid”, “sulphates” and “alcohol”. All the cases except “volatile.acidity” are positive correlations. This is normal, because “volatile.acidity” is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. For values of 5 in the “quality” variable the values for “alcohol” are very spread, although the tendency is that good wines (quality 7 or 8) have the highest median level of alcohol.
Furthermore, correlation matrices have given us a global overview of all pairwise relations in a numerical and graphical ways.
According to the correlation matrix, there are some other features with high correlation between them. These correlations are even higher than those commented above (regarding the “quality”). Variable “fixed.acidity” is correlated with “density”, “citric.acid” and “pH”. Of course the correlation is negative for “pH” because a low pH indicates a very acidic environment. This is the reason why “citric.acid” and “pH” are also negatively correlated.
As expected, variable “citric.acid” is negatively correlated with “volatile.acidity” too. This variable is more or less the opposite to “fixed.acidity”.
The negative effect of alcohol for the “density” variable is stronger than the positive effect of “residual.sugar”.
Finally, as expected, the free sulfur dioxide is correlated with the total sulfur dioxide. And of course the new variables “nonfree.sulfur.dioxide” and “pfree.sulfur.dioxide” are related with the previous ones.
The strongest relationship, ignoring that between “total.sulfur.dioxide” and “nonfree.sulfur.dioxide”, is the negative correlation (-0.68) between “fixed.acidity” and “pH”. As commented before, this is totally normal.
Lets examine and compare the combinations of our main 4 features taking into account the quality of wine as color.
For prediction purposes, we have two main problems: 1) unbalanced observation types (too many regular wines) and 2) the regular wines are very spread across feature values, so they are mixed with bad and good classes. Maybe what we should try is to predict good (or bad) wines, not to try to classify into the three classes. Lets check only bad wines against good wines. In this case, we also add some density 2D maps in order to see where are located the clusters or groups for each combination of features:
If we select only “bad” and “good” wines we can appreciate that most of good wines have medium values of citric acid and low values of volatile acidity. On the other hand, bad wines usually have medium-high volatile acidity and low citric acid. This is similar for combinations of “volatile.acidity” with “sulphates” or “alcohol”: good wines are upper left and bad wines are lower right. This tendency is similar in “alcohol” vs “citric.acid” or “sulphates”, although in this case good wines are on the upper right and bad wines on the lower left. For the combination “citric.acid” vs “sulphates”, we can appreciate more or less an horizontal line separating good and bad wines:
And finally, lets create 4 simple linear models using our four main features. The first model includes only “alcohol” as predictor. Then next model add “volatile.acidity”. Model 3 adds also “sulphates”, and the last model adds “citric.acid”. We have been testing different combinations (data not shown) to find this one:
##
## Calls:
## m1: lm(formula = I(quality_num ~ alcohol), data = data)
## m2: lm(formula = quality_num ~ alcohol + volatile.acidity, data = data)
## m3: lm(formula = quality_num ~ alcohol + volatile.acidity + sulphates,
## data = data)
## m4: lm(formula = quality_num ~ alcohol + volatile.acidity + sulphates +
## citric.acid, data = data)
##
## =========================================================
## m1 m2 m3 m4
## ---------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.646***
## (0.175) (0.184) (0.196) (0.201)
## alcohol 0.361*** 0.314*** 0.309*** 0.309***
## (0.017) (0.016) (0.016) (0.016)
## volatile.acidity -1.384*** -1.221*** -1.265***
## (0.095) (0.097) (0.113)
## sulphates 0.679*** 0.696***
## (0.101) (0.103)
## citric.acid -0.079
## (0.104)
## ---------------------------------------------------------
## R-squared 0.227 0.317 0.336 0.336
## adj. R-squared 0.226 0.316 0.335 0.334
## sigma 0.710 0.668 0.659 0.659
## F 468.267 370.379 268.912 201.777
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1599.384 -1599.093
## Deviance 805.870 711.796 692.105 691.852
## AIC 3448.114 3251.628 3208.768 3210.186
## BIC 3464.245 3273.136 3235.654 3242.448
## N 1599 1599 1599 1599
## =========================================================
We have seen that “alcohol” is the most important feature, followed by “volatile.acidity”. The “sulphates” feature adds some small improvement, but “citric.acid” do not improve the model (something we already suspected thanks to the scatter plots).
Although models are not very good (R2 are very low, 0.336 in the best case, model 3), predictions are reasonable. If we use rounded predicted quality values then we predict correctly 58% of the qualities. But if we use quality classes (bad, regular and good), then we increase the success rate to 83%:
## [1] 0.5822389
## [1] 0.833646
But there is an important thing to note. Our dataset is very unbalanced: 82% of quality values are 5 or 6 (class regular). So if we use a dummy model that always predict “regular” class, then we will achieve a success rate of 82%. If we use quality numbers, there are almost 43% of quality values “5”, so a success rate of 58% is only a small improvement.
Since the outcome “quality” is a little subjective, and also it is the median of several evaluators, we have thought that it is better to take into account only 3 categories or classes: bad wines (qualities of 4 or lower), regular wines (qualities 5 or 6) and good wines (qualities of 7 and higher). It will help us to see the differences regarding their features.
So we have compared our 4 main variables (“volatile.acidity”, “citric.acid”, “sulphates” and “alcohol”) in a pairwise mode, taking into accound the different “classes” of wine. Here we have seen that regular wines are very spread; most of the times there is not a good limit between a bad and a regular wine, or between a good and a regular wine. On the other hand, bad wines and good wines are more distinguishable between them.
What we have seen is that most of good wines have medium values of citric acid and low values of volatile acidity. Bad wines usually have medium-high volatile acidity and low citric acid. This is similar for combinations of “volatile.acidity” with “sulphates” or “alcohol”: good wines are upper left and bad wines are lower right. This tendency is similar in “alcohol” vs “citric.acid” or “sulphates”, although in this case good wines are on the upper right and bad wines on the lower left. For the combination “citric.acid” vs “sulphates”, we can appreciate more or less a horizontal line separating good and bad wines.
Yes, according with our previous bivariate studies, it seems that there is a positive correlation between “citric.acid” and “quality”. But if we observe the scatter plots by class of wine (only good and bad), we do not see a clear cutoff of “citric.acid” feature to distinguish good and bad wines. Then the separation is guided by the other variables.
Yes, as explained before, we created 4 simple linear models using our four main features. The first model includes only “alcohol” as predictor. Then next model add “volatile.acidity”. Model 3 adds also “sulphates”, and the last model adds “citric.acid”. The R2 values of our models are not very good, although the sucess rates could be a little misleading. One of the main problems is that we have a very unbalanced dataset (too many “regular” wines). Maybe the biggest problem for the model is to distinguish between bad and regular wines, and between good and regular wines.
This plot shows the densities for the distributions of all features in the dataset. They are grouped according to the three quality classes for wine: bad (4 or lower quality values) in red, regular (5 or 6) in green and good (7 or higher) in blue. Those variables with less overlapping in their density curves could help us to distinguish between quality classes. Four of the best features for this purpose are: volatile acidity, citric acid, sulphates and alcohol. Other variables also could help us to detect a specific class, like fixed acidity (good wines) and % free sulfur dioxide (regular wines).
Note: text, values and ticks of Y-axis were removed for clarity
In this case, we analyse the main four features (volatile acidity, citric acid, sulphates and alcohol) with box plots by quality class of wines (bad, regular and good) using the same schema of colours. The box plots also show the mean values as a circle. For 3 of 4 features (all except volatile acidity), median (and mean) values increase with quality; although in the case of alcohol, the classes “bad” and “regular” are very similar, with the exception of some outliers in regular class. In general, the regular class values are very spread. For volatile acidity, we see a negative tendency: values of quality are higher for lower values of this feature.
Note: some outliers were removed (>= 14 for alcohol and >= 1.5 for sulphates)
In this plot we show the pairwise comparison for the six combinations of the main four features. Each combination are represented in a scatter plot. We used a subset of the wines dataset selecting only wines with quality class bad or good. We also deleted some outliers (volatile acidity >= 1.5, citric acid >= 1 and sulphates >= 2). The idea is to show that these features could help to distinguish good wines from bad wines. We are omitting regular wines because their features are so spread that it is not easy to make a distinction; nevertheless, a person usually is not interested in detected a regular wine; he/she usually wants to detect a potential good wine or to avoid a bad wine.
These scatter plots also show density 2D maps for each class. This allows us to see regions or clusters of good wine and bad wine.
We have been analysing a red wine dataset with almost 1,500 observations and 12 features. One of these features is the punctuation or quality for the wine. The objective was to analyse the other features to know their influence in wine quality. After the study of the different distributions for the features, taking into account the qualities, we determined four of the features as the most influential: volatile acidity, citric acid, sulphates and alcohol. After grouping the qualities in three classes (bad, regular and good), we saw that there was a correlation with the main features. This correlation is positive in all cases, except for volatile acidity whose correlation is negative. Multivariate analysis allowed us to see that combinations of the main features could help to determine different “spatial” regions for good wines and bad wines. We have decided that to predict regular wines does not have much sense: most of people usually want to detect a potential good wine (or avoid a bad wine).
According to our study, good wines seem to have lower volatile acidity, higher alcohol and medium-high sulphate values. Bad wines tend to have low values for citric acid; although we have seen, this feature does not improve our predictive models.
Regarding these predictive models, we have been trying a simple linear model with only one main feature, and then adding one by one the other 3 main features. Although the R2 is small, the success rates are more or less high. But this is mainly because we have a problem of unbalanced data: too many “regular” class observations.
In the future work, we should try to improve our modelling procedures balancing the data and using cross-validation techniques to detect overfitting. Also we could try some algorithm for parameters selection.
Other machine learning algorithms could work better for this problem. Decision trees could be useful to detect a path of rules to determine wine quality. Also classification algorithms could be used since quality is in fact an ordered categorical variable. There are more powerful methods, like random forest or Support Vector Machines (SVM); they could help us to get good predictors, but it would be more complicated to interpret the resulting models. k-Nearest Neighbours algorithm (k-NN) could work very well in this context, but it will not explain anything about the underlying model.